I have a question regarding the SpiceDB #spicedb

I have a question regarding the

torben_26072

02/21/2024, 12:50 PM

I have a question regarding the performance of `CheckPermission`: I'm testing this schema with some Python scripts (SpiceDB + Postgres running on Docker): https://play.authzed.com/s/NVTEhOOCjAPS/schema Test data is like in the playground, but I also added 1,000 "knowledgebases" to object "org1" with relation "owner_org" and 999,000 "knowledgebases" to object "org2" with relation "owner_org". (This was done for testing the performance of

LookupResources

.) I'm doing those 12 permission checks as defined in the playground and I'm surprised about the deviations in the execution time.

Copy code

start_time_ms = time.time() * 1000
resp = client.CheckPermission(
    CheckPermissionRequest(
        #consistency=consistency,
        resource=kb1,
        permission="view",
        subject=user1,
    )
)
end_time_ms = time.time() * 1000
print(f"Duration {end_time_ms - start_time_ms} ms")
...

The lowest execution time I've seen was < 1 ms, the highest was > 100 ms. These are my last 2 runs (same order of checks):

Copy code

Duration 7.9990234375 ms
Duration 1.00146484375 ms
Duration 47.999267578125 ms
Duration 98.03125 ms
Duration 50.020263671875 ms
Duration 54.95703125 ms
Duration 44.98779296875 ms
Duration 50.00244140625 ms
Duration 49.06201171875 ms
Duration 51.076171875 ms
Duration 49.76171875 ms
Duration 49.468505859375 ms

Copy code

Duration 10.9677734375 ms
Duration 2.00048828125 ms
Duration 1.001220703125 ms
Duration 3.025634765625 ms
Duration 0.965576171875 ms
Duration 0.977294921875 ms
Duration 46.08935546875 ms
Duration 0.98779296875 ms
Duration 1.041259765625 ms
Duration 47.974365234375 ms
Duration 52.947265625 ms
Duration 3.998779296875 ms

Are those deviations expected? Or has that something to do with my setup? I thought the same check would take about the same duration each time if nothing changed. I also thought using

consistency=Consistency(fully_consistent=True)

would change those durations to ~50 ms or more each time, but it did not notice any differences.

vroldanbet

02/21/2024, 1:36 PM

this is expected. Data is cached in memory for a given amount of time, defined by the so called quantization window. if the subproblems needed are not part of the current quantization window, the server will have to go and recompute them. Same for fully-consistent: it also uses caching if head revision is in memory. The only difference is how the revision is computed. If no writes have taken place, then no new revision.

vroldanbet

02/21/2024, 1:42 PM

are those checks supported going through the path of 990K elements?

vroldanbet

02/21/2024, 1:42 PM

or you are testing a different part of the graph?

torben_26072

02/21/2024, 1:47 PM

> Data is cached in memory for a given amount of time, defined by the so called quantization window. That's 5 seconds by default, right? When I do the same checks 3 times performance does not seem to improve, though.

Copy code

Started at 2024-02-21 14:44:13.707322
Duration 9.001953125 ms
Duration 1.998779296875 ms
Duration 1.00048828125 ms
Duration 3.001220703125 ms
Duration 0.998046875 ms
Duration 48.5283203125 ms
Duration 50.00048828125 ms
Duration 51.132080078125 ms
Duration 49.356201171875 ms
Duration 50.0 ms
Duration 51.000244140625 ms
Duration 49.0947265625 ms

Started at 2024-02-21 14:44:14.071434
Duration 1.958984375 ms
Duration 47.00390625 ms
Duration 50.943603515625 ms
Duration 100.94140625 ms
Duration 49.697509765625 ms
Duration 50.099853515625 ms
Duration 51.009033203125 ms
Duration 48.994140625 ms
Duration 50.000732421875 ms
Duration 51.232666015625 ms
Duration 48.90966796875 ms
Duration 49.99853515625 ms

Started at 2024-02-21 14:44:14.621280
Duration 49.306884765625 ms
Duration 51.226806640625 ms
Duration 49.12548828125 ms
Duration 100.440185546875 ms
Duration 48.993896484375 ms
Duration 50.0966796875 ms
Duration 50.278076171875 ms
Duration 48.32568359375 ms
Duration 52.20556640625 ms
Duration 48.078125 ms
Duration 51.133056640625 ms
Duration 48.33154296875 ms

torben_26072

02/21/2024, 1:57 PM

The checks are in this order:

Copy code

1. resource=kb1, permission="view", subject=user1
2. resource=kb1, permission="view", subject=user2
3. resource=kb1, permission="view", subject=user3
4. resource=kb1, permission="view", subject=admin1
5. resource=kb1, permission="edit", subject=user1
6. resource=kb1, permission="edit", subject=user2
7. resource=kb1, permission="edit", subject=user3
8. resource=kb1, permission="edit", subject=admin1
9. resource=org1, permission="membership", subject=user1
10. resource=org1, permission="membership", subject=user2
11. resource=org1, permission="membership", subject=user3
12. resource=org1, permission="membership", subject=admin1

The last 4 checks should have nothing to do with the 1.000.000 knowledgebases assigned to org1/org2.

Copy code

definition organization {
    relation admin: user
    relation member: user
    permission membership = admin + member
}

But as you can see they still take 50+ ms a lot of times.

vroldanbet

02/21/2024, 2:03 PM

there may be something off happening for wide relations. I'd suggest visualizing the open-telemetry traces to see what's going on. You can also do

zed backup create

of your instance and send it our way

torben_26072

02/21/2024, 3:45 PM

I'm trying to set up tracing. Just want to mention the links to "global flags" here are broken: https://authzed.com/docs/spicedb/ops/observability

torben_26072

02/21/2024, 3:48 PM

I think I have to set --otel-endpoint="", --otel-insecure and --otel-provider="otlpgrpc"?

vroldanbet

02/21/2024, 4:15 PM

correct

torben_26072

02/22/2024, 9:55 AM

Got it to work like this:

--otel-endpoint="jaeger:4317" --otel-provider="otlpgrpc" --otel-sample-ratio 1 --otel-insecure

torben_26072

02/22/2024, 10:58 AM

Copy code

Started at 2024-02-22 11:09:02.202036
Duration 51.400146484375 ms
Duration 49.38134765625 ms
Duration 49.435546875 ms
Duration 98.85498046875 ms
Duration 50.096923828125 ms
Duration 49.4140625 ms
Duration 50.287353515625 ms
Duration 49.98583984375 ms
Duration 50.41748046875 ms
Duration 50.361328125 ms
Duration 48.8427734375 ms
Duration 51.60595703125 ms

Those durations don't really match what's shown in the traces (everything at 0.x ms). Do you know where that additional delay could be coming from? Is it just network? But everything is running locally on the same machine in Docker containers... https://cdn.discordapp.com/attachments/1209844584520097812/1210178733160534107/grafik.png?ex=65e99dc6&is=65d728c6&hm=04566c7170e827f7ff00c1ba98a997e2c5333d78c533cf6b2392584e1a08d0e6&

vroldanbet

02/22/2024, 11:43 AM

are you observing a trace of

CheckPermission

? or is that perhaps something else?

torben_26072

02/22/2024, 11:54 AM

Yes, those are

CheckPermission

. https://cdn.discordapp.com/attachments/1209844584520097812/1210192966145155102/Screenshot_2024-02-22_at_12-53-30_Jaeger_UI.png?ex=65e9ab07&is=65d73607&hm=a5b22a73d83cf500548b275d8ab519089b84a985e72e2a652489eacc2a67446b&

torben_26072

02/22/2024, 11:59 AM

I noticed the HTTP API is a lot faster than gRPC in Python. Python code:

Copy code

for x in range(10):
    #print(f"Started at {datetime.now()}")

    start_time_ms = time.time() * 1000
    resp = client.CheckPermission(
        CheckPermissionRequest(
            consistency=Consistency(minimize_latency=True),
            resource=ObjectReference(object_type="knowledgebase", object_id="kb1"),
            permission="view",
            subject=SubjectReference(object=ObjectReference(object_type="user", object_id="user1"))
        )
    )
    end_time_ms = time.time() * 1000
    print(f"Duration {end_time_ms - start_time_ms} ms")

Python logs:

Copy code

Duration 9.5419921875 ms
Duration 4.2685546875 ms
Duration 49.626708984375 ms
Duration 51.244140625 ms
Duration 49.075927734375 ms
Duration 50.07470703125 ms
Duration 49.827880859375 ms
Duration 49.335205078125 ms
Duration 49.7177734375 ms
Duration 49.9755859375 ms

gRPC in Python takes about 50 ms most of the time. HTTP API request in Postman:

http://localhost:8443/v1/permissions/check

Copy code

{
  "consistency": {
    "minimizeLatency": true
  },
  "resource": {
    "objectType": "knowledgebase",
    "objectId": "kb1"
  },
  "permission": "view",
  "subject": {
    "object": {
      "objectType": "user",
      "objectId": "user1"
    }
  }
}

HTTP API requests take between 4-7 ms most of the time.

torben_26072

02/22/2024, 11:59 AM

https://cdn.discordapp.com/attachments/1209844584520097812/1210194236079603742/grafik.png?ex=65e9ac36&is=65d73736&hm=ed623a22c3b53fc87d0ecea252cf690be50ccfe1ddecae1be8b764d75adbfd7c&

Duncan

02/22/2024, 12:17 PM

We're also using python + gRPC against a selfhosted SpiceDB custer in k8s, and were having a frustrating time. Performance for us is mostly very fast, but we have been having gRPC issues against the selfhosted cluster. Interestingly, not against the serverless: https://discord.com/channels/844600078504951838/844600078948630559/1209441355655544902 Interesting to see that the httpAPI is more efficient for you. Performance is acceptable for us as is, but we might take a look at switching. The spiceDB logs indicate ~2-10ms processing times for most of our queries. Roundtrip time from Python is definitely much larger than that, but still a small fraction of the time our python API takes to gather metadata and respond.

torben_26072

02/22/2024, 12:30 PM

I've written a small Python script comparing the performance of gRPC and HTTP. HTTP is a lot faster. Maybe something is wrong with the gRPC client? Output:

Copy code

--- START GRPC ---
Duration 12.93017578125 ms
Duration 1.581787109375 ms
Duration 41.468017578125 ms
Duration 49.78662109375 ms
Duration 50.671142578125 ms
Duration 50.17919921875 ms
Duration 50.543212890625 ms
Duration 50.43701171875 ms
Duration 49.665771484375 ms
Duration 51.866455078125 ms
--- END GRPC ---
--- START HTTP ---
Duration 11.78662109375 ms
Duration 4.828125 ms
Duration 6.3837890625 ms
Duration 4.811279296875 ms
Duration 3.78759765625 ms
Duration 4.275146484375 ms
Duration 5.304443359375 ms
Duration 4.394775390625 ms
Duration 6.509521484375 ms
Duration 3.7783203125 ms
--- END HTTP ---

https://cdn.discordapp.com/attachments/1209844584520097812/1210202026361888768/grpc_vs_http.py?ex=65e9b377&is=65d73e77&hm=ec5f28997eabfe6690a78d9afd03b592a2bde90b2b367c6196a76e2f5af02179&

vroldanbet

02/22/2024, 12:37 PM

that does not make sense. The HTTP API is literally hitting the gRPC internally

vroldanbet

02/22/2024, 12:38 PM

maybe something is broken about the python gRPC client?

vroldanbet

02/22/2024, 12:38 PM

can you try with something like grpcurl?

vroldanbet

02/22/2024, 12:39 PM

which version of the authzed py client are you using?

torben_26072

02/22/2024, 12:47 PM

Copy code

pip show authzed
Name: authzed
Version: 0.13.0
Summary: Client library for SpiceDB.
Home-page:
Author: Authzed
Author-email: support@authzed.com
License:
Location: c:\users\...\appdata\local\programs\python\python310\lib\site-packages
Requires: google-api-core, google_api, grpcio, protobuf, typing-extensions
Required-by:

torben_26072

02/22/2024, 12:49 PM

Seems to be the newest version: https://pypi.org/project/authzed/

vroldanbet

02/22/2024, 1:10 PM

@torben_26072 would you mind opening an issue in authzed-py with your findings? @Duncan feel free to add there too any information you may have. @torben_26072 if you can export with

zed backup create

and upload it somewhere and reference it in the issue that'd be super usefl

torben_26072

02/22/2024, 1:11 PM

Will do that.

vroldanbet

02/22/2024, 1:12 PM

if you configured your client with open telemetry and propagated the changes, and captured traces from it to, we would be able to see where the time is being spent. Did you check the gRPC latencies reported by SpiceDB logs? do they show the same gRPC latencies as the client?

vroldanbet

02/22/2024, 1:15 PM

are you running M1 mac? https://github.com/grpc/grpc/issues/28817

torben_26072

02/22/2024, 1:16 PM

Windows 10 22H2, Intel Core i7 13700HX

torben_26072

02/22/2024, 1:18 PM

grpc.time_ms is 0.

Copy code

2024-02-22 14:17:33 {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"b9b9c315e73f05c0357710a083e290ad","peer.address":"172.19.0.1:35872","grpc.start_time":"2024-02-22T13:17:33Z","grpc.code":"OK","grpc.time_ms":0,"time":"2024-02-22T13:17:33Z","message":"finished call"}
2024-02-22 14:17:33 {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"7c1d3564adcdf1524b8f868699e0239a","peer.address":"172.19.0.1:35872","grpc.start_time":"2024-02-22T13:17:33Z","grpc.code":"OK","grpc.time_ms":0,"time":"2024-02-22T13:17:33Z","message":"finished call"}
2024-02-22 14:17:33 {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"c4da1ecf85a49e90167785fa4febad5e","peer.address":"172.19.0.1:35872","grpc.start_time":"2024-02-22T13:17:33Z","grpc.code":"OK","grpc.time_ms":0,"time":"2024-02-22T13:17:33Z","message":"finished call"}

torben_26072

02/22/2024, 1:19 PM

Those requests measured ~50 ms in Python.

vroldanbet

02/22/2024, 1:19 PM

yikes

vroldanbet

02/22/2024, 1:20 PM

are you referencing

localhost

when initializing the client or

127.0.0.1

vroldanbet

02/22/2024, 1:36 PM

just an additional data point, the gRPC benchmarking dashboard shows how python implementation is always the slowest. QPS throuhgput is terrible https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5180705743044608

Duncan

02/22/2024, 1:40 PM

Sorry to confuse your thread, Torben. We also see lots of

grpc_time_ms:0

requests taking much longer by the time python gets its answer, but we do have a trip over the internet and through k8s via an nginx load balancer. Our overall results are not THAT dissimilar however: https://cdn.discordapp.com/attachments/1209844584520097812/1210219492316020787/image.png?ex=65e9c3bb&is=65d74ebb&hm=eb36d1e730ea33ff623c5fd6371dc9894c9f49e09d08a6f08ae314eda1332d24& https://cdn.discordapp.com/attachments/1209844584520097812/1210219492676993064/image.png?ex=65e9c3bb&is=65d74ebb&hm=d5288f40d554a4411a4ce7f5569753218ff54e1c8d5ff45fc662c6e0df127264&

torben_26072

02/22/2024, 1:46 PM

Oh wow, you found the problem. I changed localhost to 127.0.0.1 and now it looks like this:

Copy code

CHECK_PERMISSION:
--- START GRPC ---
Duration 7.995361328125 ms
Duration 1.653076171875 ms
Duration 2.1259765625 ms
Duration 1.63671875 ms
Duration 1.623291015625 ms
Duration 1.56005859375 ms
Duration 1.61279296875 ms
Duration 1.6015625 ms
Duration 1.08837890625 ms
Duration 1.05615234375 ms
--- END GRPC ---
--- START HTTP ---
Duration 8.16552734375 ms
Duration 4.935546875 ms
Duration 3.897705078125 ms
Duration 4.309814453125 ms
Duration 4.350341796875 ms
Duration 4.359130859375 ms
Duration 3.88623046875 ms
Duration 3.833740234375 ms
Duration 7.128662109375 ms
Duration 3.900634765625 ms
--- END HTTP ---

torben_26072

02/22/2024, 2:14 PM

That's for

CheckPermission

. For

LookupResources

the HTTP API still is a lot faster, probably because the result (1,000 resources) is not streamed.

Copy code

LOOKUP_RESOURCES:
--- START GRPC ---
Duration 55.7783203125 ms
Duration 37.469482421875 ms
Duration 40.148193359375 ms
Duration 41.048583984375 ms
Duration 44.19091796875 ms
Duration 44.93505859375 ms
Duration 40.215576171875 ms
Duration 41.070556640625 ms
Duration 49.726806640625 ms
Duration 39.05126953125 ms
--- END GRPC ---
--- START HTTP ---
Duration 15.42919921875 ms
Duration 19.097412109375 ms
Duration 14.84326171875 ms
Duration 19.34765625 ms
Duration 17.08740234375 ms
Duration 15.32421875 ms
Duration 20.00341796875 ms
Duration 17.95166015625 ms
Duration 16.96728515625 ms
Duration 17.973876953125 ms
--- END HTTP ---

torben_26072

02/22/2024, 2:15 PM

https://cdn.discordapp.com/attachments/1209844584520097812/1210228458735673394/grpc_vs_http_lookup_resources.py?ex=65e9cc15&is=65d75715&hm=bb700a75c6d8e5a6fed1ef40d0a5cd24d6a623f521c318568f75c3a8f1f8ee2d&

Duncan

02/22/2024, 2:35 PM

oh, nice!

torben_26072

02/22/2024, 2:50 PM

For me the question is now: Is there a general performance problem with Python gRPC when using a domain instead of an IP address? It's not always possible to use a static IP address.

torben_26072

02/22/2024, 3:03 PM

@vroldanbet Should I still create an issue for the gRPC performance problems with Python?

vroldanbet

02/22/2024, 3:05 PM

From a cursory look online, everything points at python's implementation of gRPC being bad. So I'm not surprised about the Lookup result since you read about how proto serialization/serialization is slow in python

vroldanbet

02/22/2024, 3:06 PM

Maybe the recommendation moving forward is to tell people not to do gRPC client with python

vroldanbet

02/22/2024, 3:06 PM

And instead use the HTTP Gateway

torben_26072

02/22/2024, 3:07 PM

But that doesn't explain the difference between gRPC performance for

CheckPermission

with localhost and 127.0.0.1. Maybe something can be changed there?

torben_26072

02/22/2024, 3:13 PM

That depends on the size of the returned resources. For 1,000 returned resources HTTP is a lot faster. For 100 returned resources there's not really any difference. For 10 resources gRPC is faster than HTTP.

vroldanbet

02/22/2024, 3:14 PM

Yeah, that I don't know, I just found a GitHub issue online that said that said the networking between host and docker subsystem in Windows was slow for whatever reason

vroldanbet

02/22/2024, 3:15 PM

Yeah, as mentioned, proto in python being kinda slow

torben_26072

02/22/2024, 3:17 PM

It's possible it has something to do with Docker, but localhost:50051 (gRPC) being slow and localhost:8443 (HTTP) being fast for

CheckPermission

is still strange.

vroldanbet

02/22/2024, 3:18 PM

Can you try running the python script inside of the container?

vroldanbet

02/22/2024, 3:18 PM

To skip the networking stack

torben_26072

02/22/2024, 3:19 PM

Could be something like this: https://github.com/pytorch/serve/issues/1375

torben_26072

02/22/2024, 3:20 PM

Good idea.

torben_26072

02/22/2024, 4:44 PM

I don't really know how to get Python running in the SpiceDB container. There's no package manager or gcc.

torben_26072

02/22/2024, 4:59 PM

I'm now using the Windows version of SpiceDB and the performance of localhost and 127.0.0.1 are similar. So it's most likely a Docker or WSL2 problem:

Copy code

CHECK_PERMISSION:
--- START LOCALHOST ---
Duration 7.581298828125 ms
Duration 1.0810546875 ms
Duration 1.111328125 ms
Duration 1.0556640625 ms
Duration 0.536865234375 ms
Duration 1.054931640625 ms
Duration 1.059326171875 ms
Duration 0.53173828125 ms
Duration 1.0576171875 ms
Duration 1.080078125 ms
--- END LOCALHOST ---
--- START 127.0.0.1 ---
Duration 1.719970703125 ms
Duration 1.083984375 ms
Duration 0.52734375 ms
Duration 1.134033203125 ms
Duration 0.904296875 ms
Duration 0.85009765625 ms
Duration 0.296630859375 ms
Duration 1.37548828125 ms
Duration 0.818115234375 ms
Duration 2.551513671875 ms
--- END 127.0.0.1 ---

Copy code

LOOKUP_RESOURCES:
--- START LOCALHOST ---
Duration 43.968994140625 ms
Duration 34.41650390625 ms
Duration 33.071044921875 ms
Duration 36.738525390625 ms
Duration 37.85009765625 ms
Duration 44.834716796875 ms
Duration 39.364990234375 ms
Duration 39.24169921875 ms
Duration 40.6171875 ms
Duration 35.39599609375 ms
--- END LOCALHOST ---
--- START 127.0.0.1 ---
Duration 37.03466796875 ms
Duration 37.453125 ms
Duration 38.0908203125 ms
Duration 33.736083984375 ms
Duration 31.812255859375 ms
Duration 33.2841796875 ms
Duration 34.0 ms
Duration 30.41259765625 ms
Duration 35.941650390625 ms
Duration 35.09912109375 ms
--- END 127.0.0.1 ---

vroldanbet

02/22/2024, 5:42 PM

ok, so it's docker stuff on windows. Good to know

vroldanbet

02/22/2024, 5:42 PM

that's for working through this 🙇‍♂️

vroldanbet

02/22/2024, 5:42 PM

useful information to have

5 Views

Previous Next